AITopics | back razor

bc6a1f968f8b1dae3e880f3f723d7d46-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 15:37:36 GMT

activation, arxiv preprint arxiv, back razor, (12 more...)

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Texas > Brazos County > College Station (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation

Neural Information Processing SystemsDec-25-2025, 03:30:36 GMT

Transfer learning from the model trained on large datasets to customized downstream tasks has been widely used as the pre-trained model can greatly boost the generalizability. However, the increasing sizes of pre-trained models also lead to a prohibitively large memory footprints for downstream transferring, making them unaffordable for personal devices. Previous work recognizes the bottleneck of the footprint to be the activation, and hence proposes various solutions such as injecting specific lite modules. In this work, we present a novel memory-efficient transfer framework called Back Razor, that can be plug-and-play applied to any pre-trained network without changing its architecture. The key idea of Back Razor is asymmetric sparsifying: pruning the activation stored for back-propagation, while keeping the forward activation dense. It is based on the observation that the stored activation, that dominates the memory footprint, is only needed for backpropagation. Such asymmetric pruning avoids affecting the precision of forward computation, thus making more aggressive pruning possible. Furthermore, we conduct the theoretical analysis for the convergence rate of Back Razor, showing that under mild conditions, our method retains the similar convergence rate as vanilla SGD. Extensive transfer learning experiments on both Convolutional Neural Networks and Vision Transformers with classification, dense prediction, and language modeling tasks show that Back Razor could yield up to 97% sparsity, saving 9.2x memory usage, without losing accuracy.

back razor, memory-efficient transfer learning, self-sparsified backpropagation, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

bc6a1f968f8b1dae3e880f3f723d7d46-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-18-2025, 08:30:31 GMT

artificial intelligence, back razor, machine learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.05)
North America > United States > Texas > Brazos County > College Station (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

bc6a1f968f8b1dae3e880f3f723d7d46-Paper-Conference.pdf

Neural Information Processing SystemsAug-18-2025, 08:30:27 GMT

back razor, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Texas > Brazos County > College Station (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation

Neural Information Processing SystemsJan-18-2025, 17:39:39 GMT

Transfer learning from the model trained on large datasets to customized downstream tasks has been widely used as the pre-trained model can greatly boost the generalizability. However, the increasing sizes of pre-trained models also lead to a prohibitively large memory footprints for downstream transferring, making them unaffordable for personal devices. Previous work recognizes the bottleneck of the footprint to be the activation, and hence proposes various solutions such as injecting specific lite modules. In this work, we present a novel memory-efficient transfer framework called Back Razor, that can be plug-and-play applied to any pre-trained network without changing its architecture. The key idea of Back Razor is asymmetric sparsifying: pruning the activation stored for back-propagation, while keeping the forward activation dense.

back razor, memory-efficient transfer learning, self-sparsified backpropagation, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Backpropagation (0.44)

Add feedback

Block Selective Reprogramming for On-device Training of Vision Transformers

Sarkar, Sreetama, Kundu, Souvik, Zheng, Kai, Beerel, Peter A.

arXiv.org Artificial IntelligenceMar-25-2024

The ubiquity of vision transformers (ViTs) for various edge applications, including personalized learning, has created the demand for on-device fine-tuning. However, training with the limited memory and computation power of edge devices remains a significant challenge. In particular, the memory required for training is much higher than that needed for inference, primarily due to the need to store activations across all layers in order to compute the gradients needed for weight updates. Previous works have explored reducing this memory requirement via frozen-weight training as well storing the activations in a compressed format. However, these methods are deemed inefficient due to their inability to provide training or inference speedup. In this paper, we first investigate the limitations of existing on-device training methods aimed at reducing memory and compute requirements. We then present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model and selectively drop tokens based on self-attention scores of the frozen layers. To show the efficacy of BSR, we present extensive evaluations on ViT-B and DeiT-S with five different datasets. Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x while maintaining similar accuracy. We also showcase results for Mixture-of-Expert (MoE) models, demonstrating the effectiveness of our approach in multitask learning scenarios.

activation, memory reduction, reduction, (14 more...)

arXiv.org Artificial Intelligence

2405.10951

Country: